Skip to content

gh-147991: Speed up tomllib import time#147992

Open
vstinner wants to merge 10 commits intopython:mainfrom
vstinner:lazy_tomllib
Open

gh-147991: Speed up tomllib import time#147992
vstinner wants to merge 10 commits intopython:mainfrom
vstinner:lazy_tomllib

Conversation

@vstinner
Copy link
Copy Markdown
Member

@vstinner vstinner commented Apr 2, 2026

Defer regular expressions import until the first datetime, localtime or non-trivial number (other that just decimal digits) is met.

Defer regular expressions import until the first datetime, localtime
or non-trivial number (other that just decimal digits) is met.
@vstinner
Copy link
Copy Markdown
Member Author

vstinner commented Apr 2, 2026

It might be interesting to replace from types import MappingProxyType with built-in frozendict. But currently, the GitHub Action CI runs mypy with Python 3.12 which doesn't have frozendict.

Copy link
Copy Markdown
Member Author

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I marked added constants and functions as private by adding _ prefix. I'm not sure if it's needed, all other _parser APIs are "public" (no underscore prefix).

@hugovk
Copy link
Copy Markdown
Member

hugovk commented Apr 2, 2026

It might be interesting to replace from types import MappingProxyType with built-in frozendict. But currently, the GitHub Action CI runs mypy with Python 3.12 which doesn't have frozendict.

Adding # type: ignore[name-defined] is a quick fix.

This:

diff --git a/Lib/tomllib/_parser.py b/Lib/tomllib/_parser.py
index b59d0f7d54b..96f189537cf 100644
--- a/Lib/tomllib/_parser.py
+++ b/Lib/tomllib/_parser.py
@@ -4,7 +4,7 @@
 
 from __future__ import annotations
 
-from types import MappingProxyType
+__lazy_modules__ = ["tomllib._re"]
 
 from ._re import (
     RE_DATETIME,
@@ -42,7 +42,7 @@
 KEY_INITIAL_CHARS: Final = BARE_KEY_CHARS | frozenset("\"'")
 HEXDIGIT_CHARS: Final = frozenset("abcdef" "ABCDEF" "0123456789")
 
-BASIC_STR_ESCAPE_REPLACEMENTS: Final = MappingProxyType(
+BASIC_STR_ESCAPE_REPLACEMENTS: Final = frozendict(  # type: ignore[name-defined]
     {
         "\\b": "\u0008",  # backspace
         "\\t": "\u0009",  # tab

Gets us from 4ms:

image

To ~0ms:

image

if pos >= end:
break
else:
if src[pos] != "\n":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this happen? We could just return None and fall back to the original path.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in many cases. See the added test_parse_simple_number(). Examples:

  • The test is true when parsing 1979-05-27: we cannot parse the date.
  • The test is false when parsing 1\n (ex: value = 1\n) or 23, 24]\n (ex: list = [23, 24]\n)

@vstinner
Copy link
Copy Markdown
Member Author

vstinner commented Apr 2, 2026

I updated the PR to replace types.MappingProxyType with frozendict type thanks to # type: ignore[name-defined] annotation (to please mypy gods).

I ran benchmarks on the latest PR using Python built in release mode (gcc -O3) on Fedora 43:

  • According to -X importtime, with this change, import tomllib takes 828 us instead of 9.0 ms on main (10.9x faster).
  • Using python -m pyperf command with ./python -S, with this change, import tomllib takes 0.98 ms instead of 9.8 ms (10x faster).

@vstinner vstinner marked this pull request as ready for review April 2, 2026 15:17
@vstinner
Copy link
Copy Markdown
Member Author

vstinner commented Apr 2, 2026

Ok, the PR is now ready for review. cc @hauntsaninja @encukou

I updated the PR to use public names. I also fixed tests for hex/oct/bin numbers.

@hugovk
Copy link
Copy Markdown
Member

hugovk commented Apr 2, 2026

cc also Tomli maintainer @hukkin.

@hukkin
Copy link
Copy Markdown
Contributor

hukkin commented Apr 2, 2026

Hi! 👋

I have an old Tomli branch where I've attempted to do very similar things, but it went into the discard pile, IIRC either because distlib's executable wrapper imports re module so trying to avoid the import didn't help (the situation today is very different because both pip and uv override the executable wrapper), or perhaps it was because the re module was faster at parsing integers than pure Python so the optimization seemed case dependent and controversial. Can't remember exactly 😄

The decimal parsing code I had was mostly as follows. (I've slightly simplified (the original parses underscored decimals too) and commented.)

# If one of these follows a "simple decimal" it could mean that
# the value is actually something else (float, datetime...) so
# optimized parsing should be abandoned.
ILLEGAL_AFTER_SIMPLE_DECIMAL: Final = frozenset(
    "eE."  # decimal
    "xbo"  # hex, bin, oct
    "-"  # datetime
    ":"  # localtime
    "_0123456789"  # complex decimal
)


def try_simple_decimal(src: str, pos: Pos) -> None | tuple[Pos, int]:
    """Parse a "simple" decimal integer.

    An optimization that tries to parse a simple decimal integer
    without underscores. Returns `None` if there's any uncertainty
    on correctness.
    """
    start_pos = pos

    if src.startswith(("+", "-"), pos):
        pos += 1

    if src.startswith("0", pos):
        pos += 1
    elif src.startswith(("1", "2", "3", "4", "5", "6", "7", "8", "9"), pos):
        pos = skip_chars(src, pos, "0123456789")
    else:
        return None

    try:
        next_char = src[pos]
    except IndexError:
        next_char = None
    if next_char in ILLEGAL_AFTER_SIMPLE_DECIMAL:
        return None

    return pos, int(src[start_pos:pos])


def parse_value(
    src: str, pos: Pos, parse_float: ParseFloat, nest_lvl: int
) -> tuple[Pos, Any]:
    ...
    simple_dec_result = try_simple_decimal(src, pos)
    if simple_dec_result is not None:
        return simple_dec_result
    ...

This should do similar things as what you have. One difference is that at least to me ILLEGAL_AFTER_SIMPLE_DECIMAL here seems easier to prove correct (by looking at tomllib's code) than NUMBER_END_CHARS.

I'd personally name the function something other than parse_something simply because no other parse_ function in tomllib returns a None. They all successfully parse or raise an error.

@vstinner
Copy link
Copy Markdown
Member Author

vstinner commented Apr 2, 2026

@hukkin: Hi! Oh, it's great that you already explored the "simple decimal number parser" strategy. I really like your implementation, it looks way better than mine! So I simply copy/pasted your code and I added you as a co-author.

I don't know how stdlib tomllib is maintained. Should I contribute this change to https://github.com/hukkin/tomli first? Or is it ok to land such change in the stdlib tomllib module first?

@vstinner
Copy link
Copy Markdown
Member Author

vstinner commented Apr 2, 2026

"Tests / CIFuzz / python3-libraries (address)" failed: it generated a TOML file of 1860 characters with with 593 [ array opening character and no ] array closing charracter. tomllib.load() fails with RecursionError. But if I use sys.setrecursionlimit(10_000), tomllib raises tomllib.TOMLDecodeError: Unclosed array (at line 1, column 1553) as expected. So it's a false alarm and I suggest ignoring it for now.

Copy link
Copy Markdown
Contributor

@hauntsaninja hauntsaninja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lazy modules change looks good to me.

It would be good to benchmark try_simple_decimal. If it's slower, I'm not sure that part of the PR is worth it... you end up saving the one-time cost of an import at the time of first load and only in a fraction of TOML documents. (If PEP 829 is accepted and there is no use for numbers in site.toml, then we can reconsider)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants